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Abstract 

In this paper, a new robust training algorithm is proposed for the generation of a set of bias-removed, noise-sup- 
pressed reference speech HMM models in adverse environment suffering from both channel bias and additive noise. Its 
main idea is to incorporate a signal bias-compensation operation and a PMC noise-compensation operation into its 
iterative training process. This makes the resulting speech HMM models more suitable to the given robust speech 
recognition method using the same signal bias-compensation and PMC noise-compensation operations in the recog- 
nition process. Experimental results showed that the speech HMM models it generated outperformed both the clean- 
speech HMM models and those generated by the conventional A:-means algorithm for two adverse Mandarin speech 
recognition tasks. So it is a promising robust training algorithm. © 2000 Elsevier Science B.V. All rights reserved. 

Keywords: Robust training algorithm; PMC noise-compensation; Signal bias<compensation; Mandarin speech recognition 



1. Introduction 

Background noise and channel bias are the two major interference factors that seriously degrade the 
performances of speech recognizers operating in adverse environments such as telephone speech through 
public switching network. Recently, IBM built an HMM-based Mandarin telephone speech recognition 
system using a large telephone speech database called *Mandarin call home database* (Liu et al., 1996). The 
vocabulary contained about 44000 words. The word and syllable error rates were, respectively, 70.5% and 
58,7%, which were much worse than those achieved in microphone-speech recognition (Lee and Juang, 
1996). In the past, many studies have been devoted to the field of robust speech recognition for adverse 
environment (Juang, 1991; Furui, 1992; Gong, 1995; Junqua and Haton, 1996). Major efforts of those 
studies were put on developing robust recognition algorithms to compensate or to eliminate noise/channel 
effect based on a given set of reference speech models trained usually in clean-speech environment. In the 
non-linear noise subtraction method (Lockwood and Boudy, 1992; Mokbel and ChoUet, 1995), a noise 
model was first estimated from the non-speech precursor of the testing utterance and then subtracted 
from the speech part in linear spectrum domain in order to obtain noise-suppressed features to be 
recognized using the clean-speech reference models. In (Acero and Stem, 1990, 1991), the CDCN 
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(codeword-dependent cepstral normalization) algorithm was proposed to estimate equalization vectors for 
the best transformation, in the maximum likelihood sense, from the universal codebook into the testing 
acoustic space in order to eliminating both the noise and channel effects. In the RASTA method 
(Hermansky and Morgan, 1994), a filter was used to eliminate the speaker/channel bias for obtaining bias- 
removed recognition features. In the parallel-model-combination (PMC) method (Gales and Young, 1996), 
clean-speech HMM models were combined with the current noise model to form ndse-compensated 
composite HMM models for recognizing noisy speech. In the state-based Wiener filtering method (Hansen 
and Clements, 1991; Ephraim, 1992; Vaseghi and Milner, 1997), a two-stage recognition method was used. 
It first used the Viterbi algorithm in the first stage to find the best state sequence for the input testing noisy 
speech, and then applied state-based Wiener filtering to estimate the clean-speech and recognized it using 
the clean-speech HMM models in the second stage. In (Zhao, 1996), a two-step procedure was employed to 
detect a spectral bias vector for the input testing utterance by using Gaussian distributed phone models. It 
then removed the estimated bias vector from the testing utterance for recognition. In the stochastic 
matching algorithm (Sankar and Lee 1996; Lee, 1998), the parameters of mapping functions between the 
testing speech and reference HMM models were estimated iteratively using the expectation maximization 
(EM) algorithm (Dempster et al., 1977). In (Minami and Furui, 1996), an integrated method for adapting 
HMM models to additive noise and channel distortion was proposed. This method first estimated the 
signal-to-noise ratio by maximizing the likelihood of the PMC-compensated HMM models to the input 
speech, and then estimated the cepstral bias by the Sankar*s method (Sankar and Lee, 1996). The procedure 
is iteratively applied until a convergence is reached. 

Apart from the above-mentioned main research stream, the robust training issue is also important for 
adverse speech recognition when the clean-speech reference models are not available. Its main concern is to 
train a set of robust reference speech models directly from a database collected in adverse environment for 
adverse speech recognition. The issue is important because the set of reference speech models obtained by 
the conventional segmental /:-means algorithm (Juang and Rabiner, 1990) is usually not robust. This is 
mainly owing to the high variability on the characteristics of the training sp>eech signals collected in the 
adverse environment. For example, a training data set collected from telephone calls through the public 
switching network will suffer diverse recording conditions caused by different background noises, different 
types of transducers, different telephone channels, etc. This will make speech patterns distribute more 
widely in the feature space so as to overlap to each other more seriously and cause the trained speech 
models degrade on their discrimination capabilities. 

In the past, many robust training algorithms have been proposed. In the signal bias removal (SBR) 
algorithm (Rahim and Juang, 1996), a codebook-based iterative signal bias removing technique was per- 
formed on both the training and testing phases for minimizing the channel-induced variations. In (Anas- 
taskos et al., 1997), the speaker-specific characteristics were first modeled by a linear- regressive 
transformation between the speaker-independent models and the speaker-dependent models. A speaker- 
adaptive training algorithm designed basing on the EM algorithm was then employed to iteratively estimate 
the parameters of the transformation and the compact speaker-normalized HMM models. In (Gong, 1997), 
a source normalization training algorithm, which modeled the environmental corruption as a form of linear 
transformation, was proposed to estimate the HMM models. The noise and channel effects were modeled 
implicitly in the linear transformation. In the testing stage, the MLLR adaptation (Gales and Woodland, 
1996) was applied to estimate the state-dependent transformation matrices and the bias terms for recog- 
nition. Those training algorithms have been shown to be effective on removing the channel biases and/or the 
speaker variations. However, the noise effect is still seldom considered in the robust training issue. 

In this study, we are interested in the robust training issue with both the signal bias and noise effects 
being considered. A robust training algorithm, referred to as the robust environment-effects suppression 
training (REST) algorithm, is proposed. The design goal of the REST algorithm is twofold. One is to 
countervail the large variability of the corrupted training samples for obtaining a set of compact reference 
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speech HMM models with both signal bias and noise being suppressed. The other is to make the generated 
compact reference speech HMM models better for a given robust speech recognition method. The REST 
algorithm is an iterative training procedure that sequentially optimizes the following three operations: 
parameter estimation for environment characterization, environment-effect compensation for speech seg- 
mentation, and environment-effect suppression for HMM model re-estimation. The parameter estimation 
for environment characterization is to detect the signal bias and to estimate the noise statistics for each 
training utterance. It assumes that each utterance has its own environmental characteristics. Based on an 
assumed environment contamination model, the environment-effect compensation uses the estimated en- 
vironment characterization parameters to adapt the HMM models to match with the current training 
utterance for optimal segmentation. Using the segmentation results and the same environment contami- 
nation model, the environment-effect suppression is to remove the signal bias and the noise out of the 
corrupted speech for updating the HMM models. Owing to the involvement of the environment-effect 
compensation operation in the training process of the REST algorithm, we expect that it will generate 
better reference speech HMM models for the robust recognition method which employs the same envi- 
ronment-effect compensation operation in the recognition process. This is especially true for the case when 
the environment-effect compensation operation is not perfect due either to the non-existence of a perfect 
one or to the use of an Inaccurate environment contamination model in its derivation. 

The organization of the paper is stated as follows. Section 2 presents the proposed REST algorithm in 
detail. Section 3 describes the robust speech recognition method using the reference speech HMM models 
generated by the REST algorithm. Effectiveness of the REST algorithm is evaluated by simulations dis- 
cussed in Section 4. Some conclusions are given in Section S. 



2. The REST algorithm 

The proposed REST training algorithm consists of an iterative procedure which sequentially performs 
the following three steps: 

1 . optimally segment each training utterance by using the environment-compensated HMM models, 

2. estimate the environment characteristics and enhance the speech by eliminating the noise using the state- 
based Wiener filtering method and by removing the signal bias using the SBR method, and 

3. re-estimate the speech HMM models. 

Operations performed in these three steps are derived based on a presumed environment contamination 
model. A schematic diagram of the model is displayed in Fig. 1 . It assumes that, for each utterance, the 
observed speech z is generated from the clean speech jc by corrupting first with a convolutional channel b 
and then with an additive noise n. Here b is assumed to be time-invariant and n is stationary throughout the 
utterance. In linear spectrum domain, the model can be expressed by 



yiif) = b(f)xx,if). 



(la) 



clean speech x i 



V 



Convolutional 
Channel 6 




observed 
speech z 



J 



additive noise n 

Fig. 1. A schematic diagram of the environment contamination model. 
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(lb) 



where the subscript / denotes the frame index and is an intermediate signal showing the corruption of the 
clean-speech with the channel bias only. We can also express the relation of x and in cepstrum domain by 



where m denotes the order of cepstral coefficient. Obviously, it is troublesome to directly estimate the 
original clean speech x in either linear spectrum domain or cepstrum domain when both noise interference 
and channel distortion exist. We had better, as suggested by above formulations, to separately deal with the 
channel distortion in cepstrum domain and the noise interference in linear spectrum domain. In the fol- 
lowing discussions, we specify a signal in linear spectrum domain and in cepstrum domain by attaching it 
with parameters / and m, respectively. 

The REST algorithm is derived as follows. Assume that the training data set contains R utterances. Let 
Ac = ^ denote the set of environmental interference models of the whole training data set, 

where b^^^ and /ijf^ = are, respectively, the signal bias and the noise model of the rth training 

utterance; fi^^^ and T^'^ are the mean vector and covariance matrix of A^^K Let Z^^^ = {z^{\, , . .z^^}) 
X^"^^ = {x^{\-'^^x^^j) be, respectively, the observed and clean-speech feature vector sequence of the rth 
utterance, and /I, denote the set of environment-effect normali^ speech HMM models that we want to 
generate. Based on the maximum likelihood criterion, the goal of an ideal robust training algorithm is to 
jointly estimate Aj, and /!« with given {Z^^^}^,^^ by 



where L( ) is the likelihood function of the observation sequence Z^''^ given the parameter set of {Aj,Ac). 
But, due to the fact that it is generally difficult to derive a close form solution for the above joint maxi- 
mization problem, we therefore use a three-step iterative training procedure in the REST algorithm to 
obtain a sub-optimal solution. The three steps are: 

1. Form the environment-compensated speech HMM models A^''^ by using the current {Aj,, A^) and use it to 
optimally segment the training utterance Z^''K 

2. Based on the segmentation result, estimate A^^^ and enhance the adverse speech Z^*"^ to obtain Y^^^ by the 
state-based Wiener filtering method; and then, estimate b^''^ and further enhance the speech Y^''^ to obtain 
X^'^ by the SBR method. 

3. Update the current speech HMM models A^ using the enhanced speech {X^*'^r=i,...^' discuss these 
three steps in more detail as follows. 

The first step of the REST algorithm is to optimally segment each training utterance using the current 
speech HMM models /Ij^^-i and the environmental interference model Acjc^-i given by the previous itera- 
tion, where the subscript k denotes the index of iteration. The task can be accomplished, based on the 
maximum likelihood criterion, by solving the following optimization problem to find the best state se- 
quence Ul'^^ = i^ti^ • • • > mixture component sequence V^''^ = {v^^^y . . . , Vj^^j^) of the optimal 
segmentation: 



yt(m) =b(m)'\-Xt(m), 



(Ic) 



4 




(2) 



{Ui'\ Vi'^) = arg max Pr{Z^^\ f/W, F<')|/l,^_,..le^., 



) 




(3) 



IV.-T. Hong. S.'H. Chen I Speech ComimmieaHon 30 (2000) 273-293 277 

where Oij denotes the transition probability from state i to state/ Eq. (3) is solved in this study by first 
forming the environment-compensated speech HMM models /l^,, using and A,^_,, and then using 
the Viterbi search to simultaneously find Uj;'^ and f^''. The formation of /!<'>_, from and A^. , is 

based on the assumed environment contamination model defined in Eqs. (lb) and (Ic), and realized by the 
following two sub-steps: 

(1.1) Calculate A^^^f in cepstrum domain by * 

= ('") + (»»), (4a) 

where Aijj^^_,(»t) and 2:^^^_,(m) are, respectively, the mean vector and covariance matrix of the ^h 
Gaussian mixture in theyth state of /l}'j_„ and b^;lt{m) is the bias vector given in A^^.i. 

(1.2) Use the PMC method to form by first transforming A^l_, from cepstrum domain to linear 
spectrum domam, then combining it with yl^'j,, in linear spectrum domain, and lastly transforming the 
result back to cepstrum domain. 

The second step of the REST algorithm is to enhance the adverse speech by first suppressing the noise 
using the state-based Wiener filtering method (Hansen and Clements, 1991; Ephraim, 1992- Vaseghi and 
Milner 1997) and by then removing the signal bias by the SBR method (Rahim and Juang. 1996). It consists 
of the following two sub-steps: 

(2.1) Noise suppression: Given the segmentation information C^'K estimate the noise model A^'i and 
eliminate it from the input adverse speech zr(/), in linear-spectrum domain, by the state-based Wiener 
filtenng metiiod to obtain the intermediate signal yj'Jif). The noise model /ig and its average power 
spectrum density I^'J{f) of the rth utterance are re-estimated from the non-speech frames by 

(,) E,=i X / (ul'j e non-speech) 

" T^r. ,( (r) : ;T— ^ ' (5a) 

1^1=1 H") J ^ non-speech j ' 

_(,) . ^ Eii X e non-speech) ^ 

2l,,li/(«,V € non-speech J ^ ^ ' 

p^r)^ ^ T,Zi P^M X l{u^} e non-speech) 
E,=./(«J2€ non-speech) ' 
where P^^^(f) is the periodogram of z{'\ which is defined as 

^;V)=||z{')(/-)r. 

and L is the analysis length of the FFT operation; /(•) is the zero-one indicator function. Basing on Eq. (lb) 
of the assumed environment contamination model, tiie Wiener filter for the Jth state of speech model and 
the rth training utterance is constructed and expressed by 

where Pyjjc-\{f) is the average power density spectrum corresponding to the /th state of the bias- 
compensated speech HMM models. After forming all state-based Wiener filters, we calculate the enhanced 
signal by 
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yljl(f) = Klif) X z^'^dh for r = 1 , . . . , and ^ non-speech. (7b) 

(2.2) SBR: Given with the segmentation information {ul^\ f^''^)^ estunate the signal bias and remove it 
from the intermediate signal ^'jlif) to obtain the environment-normalized speech estimate. The SBR 
method is realized by first transforming yl'^if) to yj'^im), then making a simplified assumption of Z^']^ = 
identity matrix in Eq. (A, 11) of Appendix A to obtain ^ 

Ell U'Ji'n) - ^ lUl ^ non-speech) 

bV{m) = ^ - ^ ^ (8a) 

Er=i /("a ^non-speech) 

and lastly removing the signal bias by 

4jlim)=yl'J{m)-b]^\m), (8b) 

The third step of the REST algorithm is to re-estimate the speech HMM models A^jt and the average 
power density spectrum {Pyjjc-i(f)}j=i^„j^j using, respectively, the enhanced speech signals {X^'^^{m)]^^ 
and {I*'^\/w)}^i,. based on the current segmentation information {(C/i'^\ ^'^^)}^i,...^, where Nj denotes 
the total number of states in HMM models. 

The combination of all operations in above three steps can be interpreted as a sequential optimal es- 
timation procedure listed in the following: 

For iteration k 

For utterance r = 1 to i^, do 

{Ui'\ = arg max Pr (Z<'->, C/<'>, . ) , (9a) 



= arg max Pr (z(^>|/ir, y^'^)), (9b) 

n'^' = arg^maxPr(rW|z<').f/<^4:>.{P,,^.,}^,^J, (9c) 

b'i[^ = arg^max Pr (v^y^K K^) , /l,;^.,) , (9d) 

Ai'" = arg max Pr (a-W | 6^'') . (9e) 
End loop for r 

A^j, = arg^max Pr ({^*^'^}_, ^l^lx. (^/^'^ ' ^^g) 
Repeat for k until the average likelihood score converges. 
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A similar idea was used in (Lim and Oppenheim, 1978; Hansen and Clements, 1991) to employ a se- 
quential MAP estimation procedure in an iterative algorithm to sequentially estimate the linear prediction 
coefficients, gain, and the noise-free speech waveform for frame-level speech enhancement. 

The REST algorithm can also be derived by using the EM algorithm (Dempster et al., 1977). So its 
convergence can be guaranteed. Detailed derivations of the EM procedure for estimating A^) is given in 
Apj>endix A. * 

Like other iterative algorithms, the REST algorithm must be initialized by giving an initial set of speech 
HMM models, an initial set of state averaged power density spectra, an initial channel bias vector, and an 
initial noise model. The initial speech HMM models and the initial state averaged power density spectra can 
be constructed by a conventional ML training algorithm using either an enhanced version of the given 
adverse-speech training set or another training set with high SNR. In the study, we adopt the former 
approach to use an enhanced speech training set obtained by subtracting the given initial noise model from 
the adverse-speech training set. The initial noise models are obtained from non-speech frames of the ad- 
verse-speech training set detected by an RNN-based speech segmentation method (Hong and Chen, 1997). 
It uses an RNN classifier, directly trained from adverse speech, to classify the input speech pattern into 
three broad-classes: initial^ final and non-speech. The speech segmentation method has been shown to 
perform well in noisy environment (Hong et al., 1999). The initial bias vector is obtained by the SBR 
method using the above enhanced speech training set. 



3. The PMC-SBC method for Mandarin base^Uable recognition 

Mandarin Chinese is a tonal language. Each Chinese character is pronounced as a syllable with a tone. 
There are, in total, about 1300 syllables. If the tones are disregarded, there are only 411 phonologically 
allowed base-syllables. The phonetic structures of these 411 base-syllables are very regular and relatively 
simple as compared with English. A base-syllable can be decomposed into an optional initial and a final. 
There are in total 22 initials (including a null) and 39 finals. Although, the base-syllable set is only in 
medium size, its recognition is actually very difficult because it comprises many highly confusable sets. 
Specifically, all 411 base-syllables can be categorized into 39 confusable sets according to their finals. Like 
the English E-set, all base-syllables in each confusable set differ only in their initial consonants and are 
therefore difficult to be distinguished (Chang et al., 1993; Lee and Juang, 1996). Besides, cross-set confusion 
between these 39 sets are also easy to occur. Medial confusion and nasal-ending confusion are the two most 
commonly occurred types of cross-set confusion. Highly discriminative speech models are therefore needed 
to tackle the difficult task. In this study, a set of sub-syllable HMM models containing 100 3-state right- 
^/ia/-dependent initial models and 39 5-state context-independent final models is used as basic recognition 
units (Wang and Chen, 1998). In each state, a mixture Gaussian distribution with diagonal covariance 
matrices is used. The number of mixture in each state is variable and depends on the number of training 
samples, but a fixed maximum value is set for it. Besides, a single-state, single-mixture, utterance-dependent 
model is used for noise. 

An integrated PMC-based Mandarin base-syllable recognition method, which is a modified version of 
the PMC method for additive and convolutional noise (Gales and Young, 1995; Nakamura et al., 1996) by 
additionally considering broad-class based likelihood compensation (Hong and Chen, 1997), is employed 
in this work to test the reference speech HMM models generated by the proposed REST training algo- 
rithm. It can be regarded as the combination of the PMC method and a signal bias compensation (SBC) 
method and is referred to as the PMC-SBC method. A block diagram of the new recognizer is displayed in 
Fig. 2. Each input testing utterance is first processed in the RNN-based Speech Segmentation (Hong and 
Chen, 1997) to detect non-speech frames. The RNN-based speech segmentation uses a three-layer simple 
RNN to discriminate each input frame among three broad-classes of initial^ final and non-speech. 
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Fig. 2. A block diagram of the PMC-SBC method for testing the REST algorithm. 

Non-speech frames are then detected by comparing the RNN non-speech output with a pre-determined 
threshold and used in the noise estimations to estimate the noise model. The input utterance is then 
processed in the Noise Subtraction and Signal Bias Estimation by first subtracting the noise model estimate 
to obtain an enhanced speech and then transfomung to cepstrum domain to estimate the signal bias by the 
SBR method (Rahim and Juang, 1996). The SBR method estimates the signal bias by first encoding the 
feature vectors of the enhanced speech using a codebook and then calculating the average encoding re- 
siduals. The codebook is formed by collecting the mean vectors of mixture components of all reference 
speech HMM models. The bias estimate is then used in the Bias Compensation to convert all reference 
si>eech HMM models into bias-compensated speech HMM models. These models are then further con- 
verted, in the PMC Noise Compensation, into noise- and bias-compensated speech HMM models using the 
above noise model estimate. The PMC noise-compensation method used adopts the log-normal approx- 
imation (Gales and Young, 1993) for its noise-combination operator. These noise- and bias-compensated 
speech HMM models are then used in the One-stage DP Search to generate the recognized base-syllable 
sequence for the input adverse testing utterance. The One-stage DP Search uses a Viterbi search algorithm 
invoking with cumulative bounded-state-duration constraints (Wang and Chen, 1998) to accomplish its 
task with the help of the Likelihood Compensation. The likelihood compensation (LC) scheme used is the 
one proposed previously for improving the PMC-based recognition method for noisy Mandarin speech 
(Hong and Chen, 1997; Hong et al., 1999). The LC scheme uses the broad-class classification information, 
provided by the RNN outputs, to help reduce the recognition errors caused by the misalignments of 
syllable boundaries. Due to its importance, the LC scheme is briefly discussed as follows. Although the 
PMC method is effective on adapting the clean-speech HMM models to match with the testing noise 
environment, the discrimination capabilities of the noise-compensated HMM models are still subject to be 
degraded resulted from the noise perturbation on the distributions of the recognition features of speech 
patterns. This noise-perturbation effect will make all speech phones more difficult to be distinguished not 
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only to each other but also from the background noise. The PMC method can do nothing to compensate 
this effect. This noise-induced confusing effect was also confirmed in a recent study by Junqua et al. (1994) 
on a simple 10-digit noisy speech recognition task. They found that a large portion of recognition errors is 
owing to word boundary misalignments caused by the confusing between speech signals and the back- 
ground noise. To partially cure the weakness of the PMC method, the LC scheme uses the broad-class 
classification information provided by the RNN to assist in the recognition. It directi5^ takes the three 
RNN outputs as weighting factors to add additional scores to the log-likelihood scores of HMM states 
associated with the three broad classes, i.e., 

{Pj{z,) + cc\og{W,{t)), J e initial, 
Pj{z,) -h alog(^/r(/)), final, (10) 
Pj{zt) + a\og{WN{t)), j € non-speech, 

where Wi{t), Wpit) and W^jv(r) are the initial, final and non-speech outputs of the RNN, py(z,) is the log- 
likelihood score of state y, and a is a scaling factor to control the degree of the likelihood compensation. It is 
noted that, if hard-decisions are performed in the broad-class classification to make Wj{t\ iV/r{t) and IVi^O) 
become 0-1 functions, the LC scheme is equivalent to a restricted recognition search scheme in which only 
sub-syllables belonging to the detected broad-class are needed to be considered. 



4. Evaluation 

Performance of the proposed REST algorithm was evaluated on two multi-speaker Mandarin base- 
syllable recognition tasks. Due to the fact that the previous studies on robust training for eliminating the 
noise effect were still very few, we examined the effectiveness of the REST training algorithm on eliminating 
the noise effect in detail in the first task. Both the REST training algorithm and the PMC-SBC recognition 
method were simplified by discarding the parts related to the signal bias compensation. In the second task, 
the complete function of the REST algorithm on eliminating both the signal bias and noise effects was 
examined. In the following experiments, the base-syllable accuracy rate defined below was used to evaluate 
the recognition performance: 

1. II . /f Subs + Dels -I- Ins \ .^^/n^. 

base-syllable accuracy rate = ( 1 : — 7r-r7— 1 x 100(%), (11) 

\ number of testmg base-syllables J 

where Subs, Dels and Ins denoted the numbers of substitution, deletion and insertion errors, respectively. 

4, J. Performance evaluation I 

In the first task, the performance of the REST algorithm on the adverse environment with only additive 
noise interference was examined. The noisy speech databases used in this study were generated by artificially 
adding noises to a clean-speech database composing of 1200 utterances of four speakers including two males 
and two females. Each utterance comprised several syllables and was pronounced in such a way that every 
syllable was clearly pronounced. The database contained in total 6197 syllables including 5124 training 
syllables and 1073 testing syllables. All speech signals were digitally recorded in a laboratory using a PC with 
a 16-bit Sound Blaster card and a head-set microphone. A sampling rate of 16 kHz was used. Two noisy- 
speech databases were artificially generated from the clean-speech database by adding noises of two different 
types including the Lynx helicopter noise from NOISEX-92 (Varga, 1993) and a computer-generated white 
Gaussian noise. For simplicity, these two noise types are referred to as Lynx and White noises, respectively. 
For each noise type, the training database contained three noisy-speech data sets of 1 2, 24 and 36 dB in SNR. 
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The open test used another three data sets for each noise type with 9, 18 and 30 dB in SNR. All speech signals 
were first pre-processed for each of 20 ms Hamming-windowed frame with 10 ms shift. Then, a set of 25 
recognition features including 12 MFCC, 12 delta MFCC and a delta log-energy was computed for each 
frame. The maximum number of mixture components in each HMM state was set to be 5. 

We first examined the efficiency of the speech HMM models generated by the REST algorithm using the 
F-ratio measure (Nicholson et ah, 1997). The /'-ratio is a measure of class separability in the acoustic 
feature space and can be roughly defined by 



F-ratio = 



variance of means 
mean of variances * 



(12) 



In this test, the classes were defined to include all states of the speech HMM models. The variance of means 
is the sample variance of all state means of these HMM models, and the mean of variances is the sample mean 
of all state variances. Obviously, a larger F-ratio measure indicates a larger separation among the states of 
the speech HMM models, which in turn roughly indicates that they have a higher discrimination capability. In 
the study, two schemes of the REST training algorithm with two different sets of initial models were tested. 
The first set of initial models, denoted as INITl, was formed by the clean-speech HMM models, clean-speech 
state average power density spectra, and the exact noise models. Since INITl was an ideal model, the first 
scheme was not practical and hence was taken for reference only. The other set of initial models, denoted as 
INIT2, was a practical one and was generated by firstly segmenting all training utterances by the RNN-based 
speech segmentation method (Hong and Chen, 1997), secondly estimating the initial utterance-dependent 
noise models from non-speech frames of those training utterances, and lastly estimating the initial speech 
HMM models and the initial state average power density spectra from the enhanced version of the original 
training set obtained by subtracting the initial noise model. Figs. 3 and 4 show the feature-based F-ratio 
measures of the resulting HMM models for the two cases using Lynx and White noises, respectively. It can be 
seen from these two figures that the F-ratio measures for both schemes of the REST algorithm with INITl and 
INIT2 are comparable and are all better than the HMMb models (to be defined later) trained by the con- 
ventional Ar-means algorithm. This is especially true for the lower-order recognition features. So the speech 
HMM models generated by the proposed REST algorithm are more compact and hence expected to possess 
better discrimination capability. Fig. 5 shows the learning curve of the REST algorithm. It can be found from 
Fig. 5 that the average log-likelihood score increases monotonically with respect to the iteration number. This 
empirically shows the convergence of the REST algorithm. 



o-: REST(MIT1) 
'♦•:REST(NrT2) 
HMM. 
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Fig. 3. The F-ratto measures of the speech HMM modeis trained from the noisy speech training database corrupted with Lynx noise. 
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Fig. 4. The F-ratio measures of the speech HMM models trained from the noisy speech training database corrupted with White noise. 
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Fig. 5. The learning curve of the REST algorithm for the first task. 



We then examined the recognition performance of the speech HMM models generated by the REST 
algorithm. The performance of the HMM method when both training and testing data were clean speech 
was also tested and taken as a benchmark. Its base-syllable recognition rate was 80.5%. In this test, four sets 
of reference speech HMM models were compared. They included: 

Ml. HMMc: The HMM models trained from the clean-speech database by the ML-based segmental Ac- 
means algorithm. 

M2. HMMb: The HMM models trained from the noisy-speech database with three different SNRs by the 

ML-based segmental A:-means algorithm. 
M3. HMMr: The HMM models trained from the noisy-speech database with three different SNRs by the 

proposed REST algorithm. 
M4. HMMm: The HMM models trained from a noisy-speech data set with SNR matched with the testing 

speech by the ML-based segmental /c-means algorithm. That is, the HMM models trained from 9, 18 

or 30 dB noisy-speech data set were used to recognize noisy speech with the same SNR. 
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For comparing the performances of these four sets of reference speech HMM models on noisy speech 
recognition, the following three recognition schemes were used: 

Sl-1. The *NC' scheme: The conventional HMM recognition method without noise compensation. 
Sl-2. The TMC scheme: The conventional PMC method (Gales and Young, 1993) with noise model 

being estimated based on RNN-based speech segmentation. Its noise-compensatiqn operation used 

the log-normal approximation. 
SI -3. The TMC/LC scheme: An extended version of the TMC scheme invoking with the likeUhood 

compensation scheme. It is a degenerated version of the PMC-SBC method discussed in Section 3 

with the parts related to signal-bias compensation being discarded. 

Tables 1 and 2 show the experimental results of the open tests for the two cases using Lynx and White 
noises, respectively. It is noted that, in the implementation of the PMC recognition method using HMMb as 
reference models, the mean of the estimated noise model, fi^^^if)^ was intuitively modified by 

|0, otherwise, ^^^^ 

to count the noise effect embedded in the HMMb models. Here fi„^{f) is the noise mean of the training 
database estimated in the training process of generating the HMMb models. From Tables 1 and 2, the 
following observations can be found: 

01. For HMMb, the NC scheme performed fair for both noise types with SNR = 18 dB and SNR = 30 
dB. But it performed very bad for both noise types with SNR = 9 dB. 

02. For HMMm, the NC scheme performed very well for both noise types with all the three SNRs. 

03. For HMMb, the PMC scheme performed only slightly better than the NC scheme for both noise types 
with SNR = 18 dB and SNR = 30 dB, and much better for SNR = 9 dB. 

04. The NC scheme with HMMm performed better than the PMC scheme with HMMc for both noise 
types with all the three SNRs. 



Table 1 

The recognition results of the open tests for noisy speech corrupted with Lynx noise (unit: %) 





HMMb 




HMMc 




HMMr 




HMMm 


SNR (dB) 


NC 


PMC 


PMC 


PMOLC 


PMC 


PMCO-C 


NC 


9 


-I2.I 


34.9 


39.1 


42.3 


43.6 


48.7 


45.0 


18 


51.4 


52.0 


58.6 


62.5 


62.8 


67.7 


66.3 


30 


62.3 


65.1 


71.2 


75.1 


73.6 


78.3 


75.6 


Table 2 
















The recognition results of the open tests for noisy speech corrupted with White noise (unit: %) 








HMMb 




HMMc 




HMMr 




HMMm 


SNR(dB) 


NC 


PMC 


PMC 


PMOLC 


PMC 


PMOLC 


NC 


9 


-35,9 


29.8 


26.9 


33.0 


35.0 


38.1 


33.6 


18 


42.8 


45.2 


48.3 


52.0 


54.2 


58.0 


57.0 


30 


58.4 


59.9 


65.4 


71.8 


68.2 


73.8 


68.6 
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05. For both PMC and PMC/LC schemes, HMMr performed better than HMMc for both noise types 
with all the three SNRs. 

06. For both HMMr and HMMc, the PMC/LC scheme performed much better than the PMC scheme. 

07. The PMC/LC scheme with HMMr performed better than the NC scheme with HMMm. 

4 

Based on these observations, the following conclusions can be drawn: 

Cl-1. From 01-02, the conventional HMM method without noise compensation can be used in noisy 

speech recognition only when the noise level of the training data set is the same as that of the 

testing speech. If the training database contains noisy speech with diverse noise levels, its 

performance will degrade seriously. 
CI -2. From 01-03, the HMM models generated by the conventional k-mcans training algorithm are 

good for the NC scheme in the noise-level matched condition, fair in the noise-level interpolation 

condition, and bad in the noise-level extrapolation condition. 
CI -3. From Ol and 03, the performance improvements for the HMM method using HMMb reference 

models by the PMC noise compensation are very limited. 
CI -4. From 04, the log-normal approximation of the noise-compensation operation used in the PMC 

scheme is not perfect. 

CI -5. From 05 and CI -4, the REST algorithm is a very efficient training algorithm to generate noise- 
suppressed HMM models directly from a noisy speech database with diverse noise levels. The 
resulting HMM models perform very well in the PMC scheme for testing noisy speech with 
untrained noise levels. They are even better than the clean-speech HMM models for the PMC 
method when the noise-compensation operation is not perfect. So it is a very promising robust 
training algorithm. 

CI -6. From 06-O7, the likelihood compensation scheme is very helpful for the PMC-based noisy speech 
recognition. Actually, the PMC/LC scheme using HMMr reference models performed best in all 
cases of the test. 

An extra test on noisy English digit recognition using the NOISEX-92 database (Varga and 
Steeneken, 1993) was performed to examine the validity of the proposed REST algorithm. The database 
contains utterances of isolated digits and digit triples uttered by one male and one female speakers. 
Here only the part of isolated-digit utterances was used. The database contains in total 400 digits 
including 200 training tokens and 200 testing tokens. Each testing utterance comprises 100 digits and 
was uttered in such a way that every digit was clearly pronounced. All speech signals were first pre- 
processed for each of 25 ms Hanmiing-windowed frame with 10 ms shift. Then, 12 MFCC were 
computed for each frame and taken as the recognition features. For each digit, an 8-state HMM model 
with observations in each state being modeled by a mixture Gaussian distribution was trained. The 
number of mixture components in each state was set to be 2. Besides, a single-state, single-mixture 
model was used for noise. 

In the test, we considered the performance of the REST algorithm on the adverse environment with 
additive noise interference only. Noisy-speech databases were artificially generated from the clean-speech 
database by adding computer-generated white Gaussian noise. The noisy training database contained four 
data sets of 0, 6, 12 and 24 dB in SNR. The open test used another five data sets of -3, 0, 3, 9 and 18 dB in 
SNR. The same accuracy rate defined in Eq, (11) was used to evaluate the recognition performance. We 
note that the benchmark of the recognition performance achieved by the conventional ML-trained HMM 
method for the clean-speech case is 100%. Three recognition schemes used in the first test were compared. 
They included: 
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Table 3 

The recognition results of the NOISEX-92 database corrupted by White noise (unit: %) 



SNR (dB) HMMb-NC HMMc-PMC 



HMMr-PMC 



-3 39.5 72.0 

0 63.5 86.0 

3 82.0 94.0 

9 93.0 98.5 

18 94.0 99.5 



82.5 
94.5 
99.0 
99.5 
99.5 



1. HMMb-NC: The conventional HMM method without noise compensation using HMM models trained 
from noisy speech. 

2. HMMc-PMC: The PMC method using clean-speech HMM models. 

3. HMMr-PMC: The PMC method using HMM models trained by the REST algorithm. 

Table 3 shows the experimental results. It can be seen from the table that HMMb-NC performed the worst, 
HMMc-PMC the next, and HMMr-PMC the best. This result is consistent with what we have obtained in 
the first test of the study on adverse Mandarin speech recognition. 

4.2. Performance evaluation II 

In the second task, the performance of the REST algorithm on adverse environment with both 
channel bias and noise interferences was examined. A simulated telephone-speech database generated by 
corrupting a clean-speech database with both convolutional channel bias and additive white noise was 
used in this study. The clean-speech database was generated by 10 speakers including 8 males and 2 
females. It was a super-set of the clean-speech database used in the first task with the same recording 
condition. It contained, in total, 3050 utterances including 2572 training utterances (12800 syllables) 
and 478 testing utterances (2666 syllables). To generate the adverse-speech database, each clean-speech 
utterance was first corrupted by a computer-generated white Gaussian noise and then passed through a 
filter which simulated a telephone channel. This was realized simply by first adding the white noise in 
time domain and then adding the channel bias in frequency domain. It is noted that the assumed 
environment contamination model shown in Fig. 1 is still suitable for modeling the simulated database. 
In the training database generation, noises with levels of 12, 24 and 36 dB in SNR were separately 
added to three subsets of the clean-speech training database. These three subsets contained utterances 
of three, three and four speakers, respectively. In the testing database generation, noises with levels of 
9, 18 and 30 dB in SNR were added to the whole dean-speech testing database. To simulate the 
channel variations on the telephone speech through the public switching network, a set of 227 simu- 
lated filters was generated from a large telephone-speech database provided by Chunghwa Telecom- 
munication Laboratories. Each fitter was obtained by performing a frame-based cepstrum average to 
the long utterance of a telephone call through the public switching network. Fig. 6 shows their fre- 
quency responses. Among these 227 channel filters, 195 were used to generate the training database 
while all others were used in the testing database generation. It is noted that the stationarity of the 
environment characteristics for each utterance is guaranteed in this simulated adverse-sf>eech database 
via the use of utterance-dependent channel filter and noise level. 

The same format of speech HMM models as the first task was used here. The only difference was that the 
maximal number of mixtures used in each HMM state was increased to 20. In the REST algorithm, the 
initial condition was generated from the same adverse training database by a four-step procedure. First, 
segment all training utterances by the RNN-based speech segmentation method (Hong and Chen, 1997). 
Second, estimate the initial utterance-dependent noise model from the non-speech frames of each training 
utterance. Third, estimate the initial speech HMM models and the initial state average power density 
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Fig. 6. The frequency responses of the simulated telephone channels. 



spectra from the enhanced version of the original training set obtained by subtracting the initial noise 
models. Last, estimate the initial channel bias vectors from the same enhanced training set by the SBR 
method (Rahim and Juang, 1996). 

In this test, the following recognition schemes were compared: 

S2-1. The 'BASELINE* scheme: The conventional HMM method using the reference speech models 

trained directly from the adverse training database by the segmental /c-means algorithm. 
S2-2. The 'CLEAN' scheme: The PMC-SBC recognition method using the clean-speech reference HMM 

models, but without invoking the LC scheme. 
S2-3. The *REST-bias' scheme: The SBC recognition method using the reference HMM models trained by 

the REST algorithm without considering noise suppression. 
S2-4. The 'REST-noise' scheme: The PMC recognition method using the reference HMM models trained 

by the REST algorithm without considering signal bias removal. 
S2-5. The 'REST scheme: The PMC-SBC recognition method using the REST-trained reference HMM 

models, but without invoking the LC scheme. 
S2-6. The 'REST/LC scheme: The PMC-SBC recognition method using the REST-trained reference HMM 

models. 

Table 4 shows the base-syllable recognition results of these six schemes for adverse speech corrupted with 
channel bias and White noise. It can be found from Table 4 that, according to the recognition rate, these six 
schemes can be ordered as: REST/LC > REST > REST-noise or REST-bias > BASELINE > CLEAN. 
Based on the experimental results, the following conclusions can be made: 



Table 4 

The recognition results of the open tests for adverse speech corrupted with channel bias and White noise (unit: %) 



SNR (dB) 


BASELINE 


CLEAN 


REST-bias 


REST-noise 


REST 


REST/LC 


9 


23.4 


14.8 


24.5 


29.3 


33.0 


35.2 


18 


46,7 


27.3 


50.2 


48.4 


53.7 


56.5 


30 


60.2 


45.6 


62.7 


61.8 


65.5 


66.7 
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C2-1. The conventional HMM method using the reference models trained by the /:-means algorithm 
performed fair in adverse speech recognition. 

C2-2. The result that the CLEAN scheme performed much worse than the BASELINE scheme is mainly 
owing to the imperfection of the channel bias compensation performed in the SBC method. 
Actually, the CLEAN scheme was totally fail to compensate the mismatch between the testing 
speech and the clean-speech HMM model. This primarily resulted from the large deviation on the 
estimated signal bias from the real channel bias. 

C2-3. Although the channel bias*compensation operation of the SBC method is imperfect, the REST 

training algorithm can still take its advantage by embedding it into the iterative training process to 
make the resulting HMM models more suitable to be used with the channel bias compensation of 
the testing process. This has been confirmed by the fact that both the REST-bias and REST scheme 
performed better than the BASELINE scheme. 

C2-4. The HMM models generated by the REST algorithm which considers both noise suppression and 
signal bias removal are better than those obtained by the REST algorithm considering only noise 
suppression or signal bias. 

C2-5. The likelihood compensation scheme is still effective on assisting in the adverse speech 
recognition. 

A final test to check whether the REST training scheme is operable for clean-speech environment was 
lastly done. It is worthwhile to note that some robust training algorithms, designed for improving the 
performance of speech recognizers under adverse-speech environment, performed not well for clean-speech 
environment. In the test, two sets of HMM models were generated, respectively, by the conventional ML 
training method and by the REST training scheme using the same clean-speech database. The base-syllable 
recognition rate was 76.05% for the ML method and 76.24% for the REST scheme. This result confirmed 
that the REST algorithm did not degrade the system performance when the training data were clean 
speech. 



5. Conclusions 

A robust training algorithm for generating a set of speech HMM models directly from a training dat- 
abase collected in adverse environment for adverse speech recognition has been discussed in this paper. Its 
main advantage lies on the incorporation of the signal bias-compensation and PMC noise-compensation 
operations of a given robust adverse speech recognition method into its iterative training process so as to 
make the resulting speech HMM models more suitable to be used in the given robust adverse speech 
recognition method. Its effectiveness on generating robust speech HMM models has been confirmed by 
simulations. Experimental results showed that the HMM models it generated were even better than the 
clean-speech HMM models for use in the given robust adverse speech recognition method when the PMC 
noise-compensation and/or channel bias-compensation operations are imperfect. So it is a promising robust 
training algorithm. 
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Appendix A. The EM procedure for estimating {Aj,,A^} 

Eq. (2) can be solved using an iterative EM procedure (Dempster et al., 1977) which tries to find the local 
optimal estimate of <9 = {Ax,/!^} with the following two intermediate parameter sequences involved: 
the hidden state sequences {^^^''^ = • • - ,«3;^)}r=i,.. ,/? mixture component sequences 

{ v^'^ = (t>|'^\ . . . , ^t})} r=\,..,jt' The first (expectation) step of the EM procedure is to computt the auxiliary g- 
function defined as 

Q{e, 6>*_i) = e[ logL({Z<^), U^^\ |{Z<'-)U,....^,6)*.|}. (A.1) 

Here the subscript k - 1 denotes the iteration index. In the second (maximization) step, new values of 0* are 
computed based on the maximization of Qi&^Ok-iY 

0k = arg maxe(0, O^.i). (A.2) 
0 

The detailed derivation of the EM procedure is described as follows. 
Let 

y^ir) _ / ^{j^x\ A^^\b^''^^ for adverse-speech model, 
t A^^ for non-speech model 

be the environment-compensated HMM models, constructed from Aj, and ( A^^\ b^^^\ for the rth obser- 
vation utterance Z^''\ Here G( ) denotes a mapping function that transforms A^ to match with the current 
environment of Z^^K By assuming that, in A^^\ observations are mixture-Gaussian-distributed, we can 
calculate the mean vector ^^^^ and covariance matrix X^^^ of the qth mixture component in the yth state of 
A^^\ based on the assumed environment contamination model defined in Eqs. (lb) and (Ic), by 

(r) ^ ( {t^xj^q + b^"^) ® J e adverse-speech model, 

'^'^ I J € non-speech model, ^ * ^ 

y{r) ^ f Z^j^ <S> A^^\ J 6 adverse-speech model, . . , . 

^'J^ \2:iP>^ y € non-speech model, ^^'^^^ 

where ® denotes the PMC noise-compensation operator (Gales and Young, 1993), and fx^j^ and Z^^j^ are, 
respectively, the mean vector and covariance matrix of the ^h mixture component in the yth state of /it^. By 
further assuming that the state-based Wiener filtering is the inverse operation of the PMC (Gales and 
Young, 1993; Vaseghi and Milner, 1997), we can express the compensated cepstral mean in Eq. (A.4a) 
by (Gales and Young, 1993; Vaseghi and Milner, 1997) 

(r) _ f fijij^ + ft^*"^ + hj, j e adverse-speech model, 
^"'^ I f^^n\ J € non-speech model, ^ * ^ 

where hj is the cepstral coefiicients of the state-based Wiener filter of the /th state which is constructed from 
an estimate of the signal power density spectrum at the yth state and an estimate of the noise power density 
spectrum of the rth utterance. 

Based on the above expression of aI''\ the auxiliary g-function can be rewritten as (Sankar and Lee, 
1996) 
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2(0, 6>».,) = e((/ix,{/ir.*^'^u,....*).0*-i) 
= e(({/ir}^.....«).0*-.) 

= G*-> + EEEErJt . q) log Pr (zW, u. = y. v. = g\0) 

r=l y=l ^=s| ' 

= a-' + EEEEy{tia^)iogA((z{'>;/4:?^ (a.6) 

r=l /=! y=l ,= l 

where 

r^t.C/^^) ^ Pr(^J^\«r^ =y,«^j^^ =^|6^*>i) (A.7) 

is the probability of the observation z}'^^ produced from the ^th mixture component of the yth state; Nj and 
denote, respectively, the total numbers of states and mixture components; N( ) represents normal dis- 
tribution; and Qk-] is a function depending only on the transition probability and mixture probability of 
(which are assumed to be the same as those of /l^jt-i). But, due to the fact that it is generally difficult 
to derive a close form solution for the above joint maximization problem, a multi-stage sequential maxi- 
mization procedure is employed to approximate the local optimum of ©f In each stage, only one type of 
parameters is optimally estimated. 

We first estimate the parameters of noise model A^^l to maximize the g-function in Eq. (A.6). They can 
be obtained by 



_ Efi. E;i. e::. e-,c/-.g)(^{-' - - Aia)' 

Er:.E;i.E:i,e-.(y,9) ' ^^'"^ 

where = y^^l-xU^^VU ^ non-speech) and /(•) is the zero-one indicator function. 

We then estimate the signal bias b^l^\ After replacing A^;^ and yl, with aS^^I and A^j,.x, the g-function 
becomes 



e((^.,-„{Aa,<.<"}„,J.6...,) 

- a-, + Ei:i:i;&-,(/-,«)iog«(z;";A,j^,., + - *«,x<j«;.) 

r^l 1=1 y=i 9=1 ^ ^ 

r=l /=! y=I ,= 1 ^ ^ 

= ei-. + i:i:f:Y:y'fh-^u,<i)io^N{^^^^ (a.9) 

r=l /=! y=l <7=1 \ \ / / 

where yJS^^_, (/,<?) = yJJ-iC/,^)/^ ^ speech); /ly^ and T^;^^^ are updated versions of Ay and Z^]^ with /l^^^ and 

/Ix being replaced with aS^I and A^jt-i, and >;,^^|^ is the Wiener-filtered version of zj*^^ at the yth state. By 
solving 
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— '^^^ 8551 ^ = »• (Al«) 

the /7th element of b^^^ can be obtained by (Sankar and Lee, 1996) 

, E^i. I^.yl:^-.C/.g)(^l.>./> ))"'(;j^(p)-A.^.^-.(p)) ' ^ 

*i(p)= 7 — ^ -> (A.ll) 

Ef:. a E^, 

where >^;;|tO') and denote, respectively, the /7th elements of y^^^, and and X^'j^jiO'.p) is 

the {p,p)th element of 

We then estimate A,^. After replacing A^^^ and 6^'^' with A^^l and the g-function becomes 

(A.12) 

where 

bV = ^{^' + Ay;. - (A.13) 

is the signal bias-removed and Wiener-filtered signal of the yth state. The g-function is now in the same 
form as that in the conventional EM algorithm for estimating HMM's parameters. So, the mean and 
covariance of A^jk can be estimated in the same way by 

_ ^r=l S/ =l YljLx 53^^^ I 

E!l.Er;.E^,E:^.vJ?.-,C/.^) ' ^ 
^^^^^ e;L. Ef:, E^. e:i. y^h-.U, <i) {=4% - A.,..) - A.,,,)' ^^^^^^ 

Srr=i X^yii S^li V/V^-iC/j^) 

The HM M state transition probabilities and the mixture component coefficients can also be estimated by 
the standard EM method. 

It can be verified that the g-function will increase at each stage of the sequential maximization proce- 
dure, i.e., 

; ©*-,) = fi( (a,^-. , {Ail, . btl, L, J ; 

<fi((A.-.,{Ai:>,6r}^,J;6»...) 

= e(0*; €>*_,). (A. 15) 
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This in turn leads to an increase on the likelihood of the training data in each iteration (Dempster et al., 
1977), i.e., 

> /:({Z<^>U,...^|0*-,). (A16) 

Hence, the EM procedure is guaranteed to converge. ' 

In practical implementation, the above EM procedure needs to be modified by invoking with the seg- 
mental /:-means algorithm (Juang and Rabiner, 1990) in order to increase its computational efficiency. It 
adds an additional pre-segmentation stage into the above iterative re-estimation procedure. In each iter- 
ation, all training utterances are first optimally segmented by the Viterbi algorithm (Forney, 1973) to de- 
termine the best state sequences {i/i'^^}^!, and the best mixture component sequences {^i Then, 
parameters of all models are re-estimated based on the given {ujf\ ^. All formulations of the 
above EM procedure list^ in Eqs. (A.3HA.14) still hold except that t[}j,^y(jlq) and yJ^^^.jC/,^) are now 
associated only with {u][\ f^'^} and hence all jf^x Eqs. (A.6), (A.8), (A.9), (A.11), (A.12) and 
(A. 14) have to be taken away. 

A final modification of the above re-estimation procedure is needed to replace the optimal signal bias 
estimation with the conventional SBR method. By making a simplified assumption of = /, the 

modified version of Eq. (A. 11) can be reduced to Eq. (8a). This completes the derivations of the REST 
training algorithm. 
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ABSTRACT 

In thtt paper, we propoce « imw training algorithm, «cfmemta/ 
GPD trmininit tar hidden nwifanr modd (HMM) bMed speedt 
recognixer using Vitabi decoding. This algoriUini is based on 
the principle of minimom recognition error rate in which 
segmentation and disCTinunathpe training are jointly <ytimimcd. 
V«rio«is issues rdated to the q>edal stiucture of HMMs in seg- 
mcKlaf GPD frmuun§ are studied. We tested this algorithm on 
two speaker-independent recognition tasks. Our first experiment 
involves P.»iglt«*» K-^et. Segm^t^J GPD training was directly 
applied to HMMs generated from non-optimal uniform segmen- 
tation. A recognition rate of 88.7% was adiievedon Rnglish £^-«et 
with whole word HMMs. Our second experiment involves con* 
nected digits Tl-databaae. Seymemlsl GPD training was applied 
to HMMs which were already trained using conventional train- 
ing methods. A string recognition rate of 98.8% was achieved 
on ia«t«te whole word based HMMs through §cgmcntai GPD 
tratataf. 

1. INTRODUCTION 

The tue of hidden markoT modds (HMM*s) with Viterbi 
decoding has become » prevalent approadi ia ipeech lecog- 
niiion, because of its nmple algorithmic stmctnie and its 
dear saperiority over other alternative recognition sdiemcs. 
But, in spite of its proved high performAnoe for many recog- 
nition tasks, the conventionally trained HMMs sxe based 
mainly on the principle of st&UsUcal data fitting in terms 
of increanng the HMM likelihood. Hie optimality of this 
training criterion is conditioned on the availability of infi- 
nite araonnt of training data and the correct choice of the 
modd. However, in reality, ndther of these conditions are 
satisfied. The available training daia b always limited, and 
the assumptions made by HMMs on speedi production pro- 
cess axe often inaccurate. As a consequence, the Ekelihood 
based training are not very effective for highly discrimina- 
tive recognition applications and cannot g:oaxantee optimal 
performance (Le. minimom lecognition ctzor pcobabtlity). 
This deficiency in the traditional training methods, namdy, 
the lade of a direct rdation with the recognition error rate 
motivates the recent effort of discriminative training [1-6]. 

Despite the algorithmic beauty of Viterbi decoding, tU 
application to HMMs imposes several stringent constraints 
on training algorithms. A training algorithm for HMMs 
with Viterbi decoding most cope with the nature of the 
Hkdihood score on the optimal path, which has an intricate 



relation with HMM parameters. 

Recently-, discriminative training based on the "general- 
ized probabHiBtic descent" (GPD) method has proved to be 
successful in many applications. In this paper, we propose 
a segmental based training method, segmental GPD <roin- 
•ny, for speech fecognizer osing hidden markov modd and 
Vitetin decoding. The main features of our approadi can 
be summarised as follows: 

(1) The algorithm is based on the prindple of minimum 
recognition error rate in which segmentation and dis- 
criminative training axe jointly optimized. 

(2) The algorithm can be initialized from a given HMM, 
regaxdless of whether it has been trained according to other 
criteiia or directly generated from a training set with (non- 
optimal) nnifbnn segmentation. 

(3) The algorithm handles both errors and correct recog- 
nition cases in a theoretically consistent way, and is 
adaptivdy adjusted to adiieve an optimal configuration 
with maximum possible separation between eadi confusing 
dasses. 

(4) The algorithm can be used dther off-tine or on-line 
with the ability of learning new features from any new train- 
ing sources. 

(5) The algorithm is consistent with HMM framework 
and does not require major modification of the current sys- 
tem. Moreover, it is theoretically justified to converge to 
a (at least locally) minimum pmnt of the recognition error 
rate. 

2, THE SYSTEM CONFIGURATION 

In an HMM based recogmser. The continuous speech 
waveform is first blocked into frames and a discrete sequence 
of feature vectors, X = {*ot*it"- »*Ti*))» •»« extracted 
where T(x) corresponds to the total number of frames in the 
speech sigaaL We will identify the input speech utterance as 
its feature vector sequence X s (xo» xi, • • - . xtxs)) without 
confusion. 

In HMM, the observation probaUnty density function of 
observing vector x in the j-th state of i-th word HMM is 
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»II«MiV(r,Mw>,aij^); (1) 



which is & mixture of Gaossifts distzibaUoos, where 4^ is 
the mixiare weights «nd s&ttsfics 



: 1. 



P) 



The optimal pjith under the ViterU decoding is the one 
which attains the highest log-fikdihood score. We denote 
e to be the optima] path of the inpnt utterance AT in i- 
th word HMM Ai. Then, the log-liJcelihood score of the 
input utterance X along its optimal path in i-th model A.-, 
gi{X,Xi), can be written as 

^.(-V,A) = log/(>r.e'|A.) (3) 

T(X) 

= log6j,(xo) + J2 + E log6;j(x,.), (4) 

^1 tmi 

where ${ is the corresponding state sequence along the op- 
timal path e*, x« is the corresponding obserration vector 
at time t, TXX) is the number of frames in the input ut- 
terance X, ^s; is the sUte transition piobaUUty from 
«tate ^{.i to sUte $\, 

The rccogniser classifies the input utterance to i-th word 
aad only if . = •xgmax, ,jF(jr,A,). U we define the 
dassiiication error count functioa for i-th dass as 

A(x,Af)=( i ^ •^•igmMXjgilXiXj) 

\ 0 otherwise, 

(5) 

then the goal of training b to reduce the expected error rate 
LW^Eff^UXM)), (6) 



In practice, training result is often measured by the empir- 
kal error rate 



ff Mr 



•ml kmt 

However, direct minimii a t ion of the empirical error rate 
function has several serious drfirirncieB. It is numerically 
difficult, because classification error count function b m>t 
a Gontinuons function. The inability of the empirical er- 
ror rate function to distinguish near miss and baz^ cor- 
rect cases may impair the performance of the recogaiser on 
the independent test data set. Viterbi decoding also adds 
one more complexity here, because under the Vtterbi de- 
coding, the form and the valae of the emiwcal error rate 
function varies with segmentation determined by the HMM 
parameters. A set of numericaUy optimal HMM parame- 
ters based on the current segmenUtion does not mmini^in 
its optimality under a different segmentation, unless a i 
convergence result can be proved. 



S. SEGMENTAL GPD TRAINING OF 
HMMS 

Our approach to thb problem b to embed both the das- 
smcation error count function and the dedaon rule into a 
smooth fssctioaal £ona-Soss function. For each dass, we 
introduce a misdassification measure, whidi provides a db- 
Unce information concerning the correct dass and aU other 
competing dasses. These misdassification measures are in- 
cluded in the loss function. In segmentai GPD tnining, the 
loss function b constructed through the following steps: 

(1) Let 9j(X,Xj) be the log-likelihood score of the input 
tttterance in the j-th word modeL Define the misdassifica- 
tion measure for each dass i by 



di(X,A) = -tf.(X,A,) + h,g ^^jrrTlI«''^'^ *'*'j ' 



(8) 



where qua positive number, W b the total number of 
claM«». If (liU) reptescnU a discrete measure defined on 
the finite integer set 0|i9<^t*nd 1 <j<W) with equal 
W=i integer j, then 

= ( J e»i<^-*i)'d|i..(,-))«/«t = ye«<**i)||, (9) 

b an £^ norm approximation to max,,i,- e'i^-'^'^i* s 
oo. Miw Hssii ficat i o n measure 
di(x. A) > 0 indicates a misrlasrificaUon has been observed, 
which means that gi(X,Xi) b significantly smaller than 
^^j¥iffAX,Xj), Moreover, the sign and absolute value 
of the misdassification measure d,(x,A) implies the near 
miss and barely correct cases. 

(2) Define the smoothed loss function for each by 



(10) 



(3) Define the loss function for entire training population 

by 

w 

l(X,A) = 53f*(^, A)l(Jf € Wik). (11) 

By controlling parameters ij and 7, we can have an ac- 
curate smoothed approximation to the classification error 
count function. Therefore, minimisation of the expected 
loss of thb specially designed loss fonction b directly Itnlced 
to the minimbation of the error probability. Note that the 
misdassification measure of (8) takes into account all the 
competing dasses. Thb main an efficient mxdti-dass ad- 
justment training algorithm possible. Generalized proba- 
Ufistic decent (GPD) algorithm adjusts the modd param- 
eters A recursive^ aooording to 



A«+, « A« - €nU^VHXn, An), 



(12) 
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where n U a. property dedgiied positive definite matrix, 
{€m : n > 1) is a sequence of positive nombets, and 
Vl{X^,An) is the gradient vector of the loss function 
l(X,A) at the si-th training sample X«. Thb algorithm 
ia proved to converge, provided that EILi = 
V^^^ ^ < no (see [8] for a detailed discussion). Our em- 
pharise on the general form of GPD algorithm pven in (12) 
has its intention. In segmental GPD training, a properiy de- 
signed positive definite matrix sequence in (12) is not 
only instrumental but crucial as will be discussed in next 
section. 

4. PARAMETER TRANSFORMATIONS 
In segmental GPD training, the HMM parameters are 
adaptively a4jnsted according to (12). A diagram of this 
traiidng procedure is illast rated in Fig. 1. However, due to 
the special structure of HMMs, the parameters of HMMs 
must satisfy certain constraints. These constraints are not 
guaranteed to be satisfied by GPD algorithm. Thus, some 
treatments are necessary. 

Instead of using a complicated constrained GPD algo- 
rithm, we apply segmental GPD training on transformed 
HMM parameters. These transformations have the purpose 
of maintaining all constraints on the HMM parameters dur- 
ing the process of segmental GPD training. The following 
transformations are used in our approach: 

(1) Logarithm of the variance 

where ^jjt^ is the variance of the i-th word, j-th state, 
k-th mixture and d-th dimenaon. 

(2) Transformed logarithm of the mixture weights 

where L ia the total number of the mixture weights in the 
j-th state in the i-th word modeL 

(3) Transformed logarithm of the transition probalnfity 

where M is total number of states in i-th word model. 

A critical step of segmental GPD training Ees in how to 
handle the problem of smafl variance. Variances in HMMs 
can differ many as 10* to 10* times. Using a constant 
step sise c» for all HhlM parameters will not produce the 
desired result, because the sensitivity of the mean param- 
eter a4justment is determined by the size of the variance. 
The same can be too Urge for some mean parameters 
and too smaU for others. For a moderate comi^ex HMM 
recognition system, there are 10* to 10^ parameters to be 
a4)usted rimultancously at each iteration. Without a theo- 
retical guide, it is almost imposrible to observe the desired 
performance improvement within finite steps. 



In order to compensate tax this; vast' cUfferenoe in sen- 
ritivity, a carefufly derigned positive definite matrix Un is 
crucial. The poritive definite matrix Cr« used in our ap- 
proach is a diagonal matrix . 

dtoj(<Tj(n),.",ffi(n)). 

for each state, where ^(n) b the variance of HMM at time 
n. Thb corresponds to a GPD training on the normalised 
mean parameter ^ which talces care the senritivity issue 

6. EXPERIMENTAL EVALUATION 
The proposed segmental GPD algorithm b evaluated on 
two speaker independent tasks. In both experiments, whole 
word based HMMs were used. Each HMM b a ieft-to-right 
HMM with Gansrian mixture state observation densities. 
The covariance matrix in eadi HMM b a diagonal matrix. 
The feature vectors used in the experiments consbt of 36- 
clemento, with 12 cepstrum coefficients, 12 ddtSrcepeUum 
coefficienU, 12 delta-delU spectrum coefficients, the delta 
log energy and ddta-delU energy [9]. 

Our first experiment involves the Eng^sh E^t 
(b,c,d,e,g,p,t,v,s). The speech signal was recorded from 100 
native Americans including 50 male and 50 female through 
focal dialed-up telephone lines. Every talker spoke each 
word twice to produce two data sets. One was used for 
Uaining and the other for testing. We started with un- 
trained 10-state, 5-mixture, left^ to-right, whole word based 
HMMs, directly generated from the non-optimal uniform 
segmentation. The recogniser has a recognizaUon rate of 
76% on the testing set (89% on the training set). In 10 
iterations of the se^menlol GPD training, the recogniser 
achieved a recognition raU of 88.3% on the testing set 
(99.6% on the training set). We also tested segmental GPD 
training on 15-state, ^-mixture, lefi-to-right whole word 
based HMMs generated from uniform segmentation. The 
recognition rate of the recognizer before segmental GPD 
training s 73.3% on the testing set (86.3% on the training 
set). The recognizer aciiieved a recognition rate of 88.7% 
on the testing set (100% on the training set). In both cases, 
a 50% error rate reduction was achieved by segmental GPD 
training. These rcsulU are the best resulto reported so far 
on thb data set using the whole word based HMMs, Fig. 2 
illnstrates the performance improvement during the train- 
ing process of 15-state, 3-mixture HMMs. 

Our second experiment b related to the speaker inde- 
pendent, Tl-database of connected digit utterances. The 
digits string of Tl-database has a random length from 1 
to 7. The speedi signal was recorded from various repon 
of the United States. It contains 8565 strings for training 
and 8578 strings for testing. The model we used was a 
10.state, 64-mixtnre, whole word based HMM modd whidL 
was trained by ^be conventional method and was shown to 
achieve top performances. Oar segmental GPD training wa 
based on the word level error. In which the word boundary 
information was obtained by cunning a path on all training 
utterances. Then, segmental GPD training was appUed on 
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Original 
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Sttbstittitton Error 


60 


53 


ToUl String Error 


113 


104 


Recognition lUtc 


98.7% 
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SEGMENTAl on TMMNa OF HyM nnOQNOBI 



T^ble 1: Reoosmtioa nsoU of Tl-daU hame 

the segmented word ntteraaces. After three iterations, a 
string error rate redaction of a% on this «dl (nuiied HMM 
modd was obtained. The string xecogmtloa rate on the 
testing daU set is 98^%. As tUnstrated in TUde 1, the 
improvement was mainly dae to the redaction of the sab- 
stitution enots, which b exacUy the efiect of the word level 
error based training. 

6- SUMMARY AND DISCUSSION 
In this paper, we have proposed a new training method 
segmental GPD training, for HMM based recogniser as- 
mg >aterbi decoding. We investigated its petlbrmance on 
two speaker independent recognition tasks osing HMMs di- 
recUy generated from non-optimal aniform segmentation 
and HMMs trained by conventional methods. Segmental 
GPD trainingia based on the principle of minimum recog- 
nition error rate with a theoretically justified convergence 
property. In onr approach, both dassificaUon error count 
and the dedaion rule are embedded into a smooth func- 
tional form, and segmentation and discriminative training 
are jointly optimised for the goal of minimum recognition 
error rate. Various issues reUted to the special structure of 
HMMs in »egmeniai GPD tratmn^ are studied. We demon. 
strat«d the effectiveness of the proposed training algorithm 
in isolated word and connected digit recognition applica- 
tions. IVirther research and experimente on sub-word based 
system are in progress. 
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Abstract. This paper describes a method of adapting a continuous density HMM recogniser trained on clean cepstral speech 
data to make it robust to noise. The technique is based on parallel model combinaiion (PMC) in which the parameters of 
corresponding pairs of speech and noise states are combined to yield a set of compensated parameters. It improves on 
earlier cepstral mean compensation methods in that it also adapts the variances and as a result can deal with much lower 
SNRs. The PMC method is evaluated on the NOISEX-92 noise database and shown to work well down to 0 dB SNR and 
below for both stationary and non-stationary noises. Furthermore, for relatively constant noise conditions, there is n'' 
additional computational cost at run-time. 



Ztisammenfas«;iing. Dieser Artikel heschreibt eine Mefhode 7ur Anpassun^ cincs auf versteckien Markov Modulen 
basierenden Erkennungssys terns mil kontinuierlicher Dichte (aufgenommen uber Parameter, die die normale Sprache 
darstellenX um das System bei Vorhandensein von Larm sicherer zu machen. Diese Methode, die auf der Kombination von 
parallelen, Modellen beruht, ermoglicht die Kombination von gepaarten Larm- und Sprachzustanden, um daraus eine Rethe 
von kompensicrten Parametern zu bilden. Dies ist eine Verbesserung, im Vergleich zu den Kompensationsmethoden des 
Mittelwertes. da diese Methode auch die Anpassung der Standardabweichungen ermoglicht, wodurch wesentltch geringere 
Rauschabstande beriicksichtigt werden konnen. Diese Methode wird basierend auf der Datenbank NOISEX-92 be^iv^rtei. 
Wir zeigen, da8 diese Methode bei einem Rauschabstand von 0 dB oder kleiner im Rahmen von stationarem und nicht 
stationarem Larm gute Ergebnisse liefert. AuQerdem gibt es bei dieser Methode bei relative konstanten Larmbedingungen 
keine zusatzliche Rechenzeit in der Testphase. 

Resume. Cet article decrit une methode doni le but est d'adapter un sysieme de reconnaissance base sur des HMM a 
densitc continue (appris sur des parametres cepstraux representant de la parole normale) pour rendre le systeme plus 
robuste en presence de bruit. Cette methode, fondee sur la combinaison de modeles paralleles, permet de combiner les etais 
apparies de bruit et de parole pour fournier un ensemble de parametres compenses. Ceci est une amelioration par rapport a 
des methodes de compensation de la moyenne cepstrale car cette methode permet aussi d'adapter les variances, ce qui 
permit de prendre en compte des rapports signal sur bruit beaucoup plus faibles. Cette methode est evaluee sur la base de 
donnees NOISEX-92, Nous montrons qu'clle donne de bons resultats pour un rapport signal sur bruit de 0 dB ou inferieur 
dans le cadre de bruits stationnaires et non-stationnaires. De plus, pour des conditions de bruit relativement constante 
cette methode n*ajoute aucun temps de.calcul en phase de test. 

Keywords. Speech recognition; noise compensation; AMN; PMC. 



I. Introduction 

As speech recognition technology moves from 
the laboratory to real applications, there is a 
growing need to make systems which are robust 
to a wide range of background noises. Many 
different methods have been studied for achiev- 



ing noise robustness (Juang, 1991; Furui, 1992), 
most of which can be classified into one of two 
major approaches. 

Firstly, the corrupted speech input signal can 
be preprocessed prior to the pattern matching 
stage in an attempt to enhance the signal-to-noise 
ratio (SNR). The methods used in this approach 
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include spectral subtraction (Boll, 1979; Van 
Compcrnolle, 19S9; Lockwood and Boudy, 1992) 
and spectral mapping (Sorenscn, 1991; Cung and 
Normandin, 1992). The main difficulty with this 
approach is that it must rely solely on exploiting 
knowledge about the interfering noise since there 
can be no a priori knowledge of what will be said. 

The second class of methods attempt to mod- 
ify the pattern matching stage itself in order to 
account for the effects of noise. Methods in this 
approach include noise masking (Klatt, 1979, 
Holmes and Sedgewick, 1986; Mellor and Varga, 
1992), the use of robust distance measures 
(Mansour and Juang, 1988; Carlson and Clements, 
1991), state-based filtering (Beattie and Young, 

1991) , cepstral mean compensation (Chen, 1987; 
Berstein and Shallom, 1991; Beattie and Young, 

1992) and HMM decomposition (Moore, 1986; 
Varga and Ponting, 1989; Kadirkamanathan, 
1992). 

This paper is concerned with the latter ap- 
proach to noise rohii<;tness. Tn partirular. a 
scheme based on parallel model combination 
(PMC) will be described (Gales and Young, 1992). 
PMC is based on the assumption that knowledge 
of both the noise and the speech should be ex- 
ploited to gain maximal effect. This implies that 
noise compensation should take place in the pat- 
tern matching stage where knowledge of the 
speech to be recognised is embedded in the stored 
patterns. In the case of an HMM recogniser, this 
implies that the compensation must be state-based 
to allow stationarity of the speech component to 
be assumed. 

The PMC approach is closely related to the 
HMM decomposition approach referenced above. 
Tliere are, however, two important differences. 
Firstly, HMM decomposition operates in the log 
filter-bank domain rather than in the preferred 
cepstral domain. It therefore lacks the advantages 
of the cepstral transform in terms of parameter 
decorrclation and compactness. Furthermore, it 
requires the state variances to be diagonal, com- 
pounding the problem of correlation between the 
filter-bank channels. Secondly, it carries a high 
computational cost since the output probabilities 
have to be calculated from both the noise and 
speech distributions at run-time. PMC, on the 
other hand, works directly in the cepstral domain 



and, depending on the variability of noise 
sources, the additional run-time ovcrhel^^an be 

as low as zero. 

The remainder of this paper is organised as 
follows. In the next section, the basic theory of 
PMC is outlined and its relationship to an exist- 
ing method of cepstral mean compensation is 
discussed. Section 3 then considers the practical 
issues of covariance approximation and multiple 
state noise modelling. Section 4 describes an eval- 
uation of the PMC method on the NOISEX noise 
database (Varga er al.' 1992) and finally. Section 
5 presents some conclusions. 



2. Parallel model combination 

2.7. Basic theory 

PMC assumes that the speech to be recogn^wd 
is modelled by a set of continuous density HMMs 
which have been trained iisinp clean speech data. 
Similarly, the interfering background noise is also ; 
modelled by a continuous density HMM which 
will initially be assumed to consist of a single 
state. All signals are represented by Mel- ' 
Frequency Opstral Coefficients (MFCC^). | 

Given a clean speech HMM with state output j 
distributions characterised by means and vari- j 
ances (m,, the noise mean and covariance • 
{/i, X) is combined with each state / in turn to \ 
calculate a set of compensated distribution pa- ; 
rameters {/t, X,}. The basis of the calculation is j 
the assumption that the speech and noise are | 
additive in the linear power domain. Noisy speech j 
can therefore be regarded as being generated by \ 
the clean speech HMM operating in parallel with ! 
the noise HMM. These models can be comb' d j 
to give a compensated noisy speech HMM oy | 
mapping the distribution parameters back into I 
the linear spectral domain, finding the parame- j 
ters of the sum of the two distributions and then j 
mapping back to the MFCC domain. This Paral- 
lel Model Combination process is summarised in 
Figure 1. 

Notice that since there is only one noise state . 
there is no ambiguity as to which noise state \ 
should be combined with each speech model state. \ 
Modelling non-stationary noise will, however, re- 
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quire multiple state noise HMMs and this is 
discussed further in the next section. 

The details of the mapping procedure are as 
follows. Let $1^ and X*" be the mean and covari- 
ance of any state output distribution in the cep- 
stral domain. Cepstral parameters are derived 
from the log spectrum by the discrete cosine 
transform which can be represented by a matrix 
C. Since this transform is linear, the correspond- 
ing distribution p} and X* in the log spectral 
domain are given straightforwardly by 



If the distributions in the cepstral and log spec- 
tral domains are assumed to be Gaussian, then 
the distributions in the linear domain will be log 
normal. The i-th component of the mean /t in 
the linear domain is then given by 

M.-EFe''! 

T 



X exp( j:, -\{x-,t'f{S')-\x-ii}))dx, 

(3) 

where ^" is the region of all possible acoustic 
observation vectors x in the log spectral domain. 
As shown in Appendix A this may be simplified 
to give 

Similarly, the variance in the linear domain can 
be calculated from the expectation 



E[e'' e'>] = ( 

Jam 



1 



Xexp(x,-+x,-i(x-M'r(2:')"' 

X(x-M»))dx. (5) 
In Appendix A this is shown to reduce to 
E[e'.e^]=M,;xye^!/, (6) 



(7) 



Hence, 

J.y = E[e'' e^] - E[c'']E[e"'] 

= M/M>[e-''- l|. 

The above mapping is used to derive the distri- 
bution parameters in the linear spectral domain 
for each pair of speech and noise states. From 
the assumption that the speech and noise arc 
independent and additive, the combined mean 
and covariance are given by 

(8) 
(9) 

where (/t^ X) are the speech model parameters 
and (/i, t) are the noise model parameters. The 
factor ^ is a gain matching term introduced to 
account for the fact that the level of the origir 
clean speech training data may be different froi.. 
that of the noisy speech. 

If the combined distribution in the Iipe?»r <:pec- 
tral domain is assumed to be approximately log 




i 



Noise f^iffff^i^f^t^ 




Log SpcctraJ 
Domain 



T CepsiTdi 
* Domain 



0*8*8*8*0 

Fig. I. Parallel model conihination. 
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normal, the above process can be straightfor- 
wardly inverted. Firstly, the linear domain param- 
eters are mapped hack to the log domain by 



pL\ = \og{fLi) - 3 log 



(10) 



V5l 

-i.y = log 



+ 1 



and secondly, back to the cepstral domain by 



A' = Cfi}, 



(12) 
(13). 



To prepare an HMM recogniser for operation 
in a particular stationary noise environment, a 
one state noise model is trained on samples of 
the background, and the average noisy speech 
signal energy E^^ and the average background 
noise energy are estimated. Using a gain 
mPtrhmiT term tyiven hv 



g 



(14) 



where E^. is the average energy of the clean 
training speech, the noise state mean fi. and 
covariance £ are used to calculate compensated 
output distribution parameters [fl,, for every 
state / of every model. In a practical system, this 
, compensation process would be repeated during 
idle periods so that slowly changing noise or 
signal levels could be tracked. 



a gain matching term as d^^i 
above. This equation may be rewritten in^^ni 



ibcd 
is of 



where g is 
above. T^ 

the noisy speech observations S(f) as 

s(f)-^{f)S{f), Km 

which in turn leads to the equivalent relation in 
"terms of cepstral means. 



(17) 



where /i is the cepstral mean of the clean esti- 
mate and therefore copresponds to the mean ob- 
tained when training on clean speech and fi is 
the desired compensated cepstral mean which is 
matched to the noisy speech. The cepstral trans- 
form w of W^(/) can therefore be regarded as an 
estimate of the mean shift needed to transform 
the clean speech mean into a noisy speech mean. 

The estimate of the noisy speech mean giv-* 
by eq. (17) can be written equivalents in the . ^ 
power domain as 



(IS) 



i 



From eq. (15), assuming without loss of generality | 
that g=l and replacing 5(/) by the clean mean | 

and A/(/) by the noise mean /i gives in the log 
power domain 



= Iog(/x) - log(M +m)« 



(19) 



I 



Under the assumptions used in the PMC method. | 
substituting eqs. (19) and (10) into eq. (18) gives \ 



\A,2. 



Relationship to Wiener filtering 



ii\ = iog( /i,. + A. ) - ilogl ^ + 1 ) , (20) 



The PMC method just described is related to 
existing state-based Wiener Filtering cepstral 
mean compensation methods (Berstein and Shal- 
lom, 1991; Beattie and Young, 1992). The basic 
a.ssumption of the WF method is that the speech 
associated with a given HMM state is stationary 
with power spectrum 5(/). Hence, given knowl- 
edge of the noise power spectrum M/X a 
matched filter can be designed by 



whereas the PMC method gives 



MlF = log(M/ + /I/) - ilog 



+ 1 



(20 



gsif) 



g5(/)-h.V(/)' 



(15) 



From this it can be seen that for the WF method 
the speech variance is taken account of implicitly, 
because the HMM is trained in the log domain. 
However, the noise is estimated in the linear 
domain and, its variance is assumed to be zcro^ 
Thus, the PMC method should have advantages 
for noise with a significant variance. Further- 
more, the PMC also yields compensated covari- 
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(20) 
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ances and should therefore be more effective at 
very low SNRs. 



3. Practical implementation 

3,1. Couariance approximation 

The procedure described in the previous sec- 
tion assumes that full covariance matrices are 
used. In practice, it is common to assume that 
observation vector components are independent 
so that diagonal covariance matrices can be used. 
This reduces the amount of training data re- 
quired and reduces the run-time computational 
requirements. The PMC compensation scheme, 
however, yields full covariance matrices even 
when the initial clean speech and noise distribu- 
tions are both diagonal. Furthermore, the inde- 
pendence assumption for some noise sources is 
not justified and hence it is beneficial in some 
crf^cs i^sr. ^ full '"ovp.rip.r.c^* m^^riv for th'* noi'^'^ 
model. 

In order to avoid the run-time penalty of using 
full covariance matrices in the compensated mod- 
els, one of two simple approximations can be 
used. Firstly, the full covariance matrices can be 
made diagonal by simply setting the off-diagonal 
terms to zero. As shown below, this has little 
effect on performance. 

Secondly, a so-called ftxed variance can be 
used (Paul, 1987) whereby the diagonal variance 
of the entire clean speech data is used for all 
state variances (and never re-estimated). How- 
ever, when used for continuous speech, the fixed 
variance should be scaled to match the determi- 
nant of the noise covariance so that the normalis- 
ing constants in the inter-word noise model prob- 
ability distribution and the within-word speech 
model probability distributions are equalised. This 
is not, of course, necessary when the fixed vari- 
ance is used with the noise model as well. How- 
ever, this simplification is unnecessary since there 
is usually sufficient data to reliably estimate the 
noise variance. Fixed variance works well for 
moderate SNRs and has the advantage that it can 
be implemented as a global scaling of the input 
data, thereby significantly reducing the run-time 
cost. 



3.2. Nitn-stttfionatx noise 



In order to deal with npnTSt:itionar>' noise, a 
multi-state noise model caii'bc used. In this case, 
the same PMC method applies but now it is no 
longer possible to know a priori which noise state 
to combine with each speech model state. Hence, 
all combinations must be computed and the opti- 
mal sequence decoded at run-time. The standard 
method of dealing with this is to use a 3-dimen- 
sional Viterbi Decoding scheme based on the 
recursion (Moore. 1986). 

<!\{J, r) = max</\.,(/, i()a,ja,^.bj,{x,), (22) 

where r) is the maximum joint probability 

of being in state / of the speech model and state 
V of the noise model at time /, and observing t 
sequence to x^. The combined output proba- 
bility bj^i ) in the PMC case corresponds to the 
distiibuiioii ouiained by coiiibiiiing state j oi the 
speech model with state v of the noise model. 
Note that if the clean speech HMMs contain a 
total of M states and the noise model has A- 
states, then the compensated recogniser will have 
Mx N states. Fortunately, 2 or 3 states are usu- 
ally sufficient for the noise model. 

When the noi.se model is ergodic (i.e. fully 
connected), then the above 3-D decoding scheme 
can be synthesised using a standard 2-D decoder 
operating on an expanded model whereby each 
original speech state / is replaced by A' compen- 
sated states il, /2, ...,/N with transition proba- 
bilities given by 



(2- 



Figure 2 illustrates this for the case of a 3 state 
word model combined with a 2 state noise model. 
The obvious advantage of this scheme is that it 
enables standard HMM recognisers to work in 
non-stationar\' background noise. Note, however, 
that there is a small loss of information at the 
word boundaries since the effective self-loop 
transition probabilities for the noise states cannot 
be preserved exactly. In practice, this seems to be 
of no real significance. 



236 



MJ.F. CaUs, SJ. Young / Cepstral parameter co/npt^nsaiion 



i 



Clean Speech Model 



Noise Model 



f] f\ 




Compensated Model 

Fig. 2. Constructing a compensated model for non-stationary 
background noise. 

4. Evaluation on NOISEX-92 

In this section, a number of experiments using 
the NOISEX-92 Database are reported (Varga et 
al., 1992). This data was pre-processed usin_fi a 25 
msec Hamming window and a 10 msec frame_ 
j>ertod. For each frame, a set of 15 MFCC coeffi- 
cients were computed. The zeroth cepstr^l coeffi- 
cient is cnmpiif^.^ ^j]d <;tnrpd <ince. is P^^Trfl^^^ 'P 
t he PMC mapping procejlure. However, it is sub- 
sequently dropped in the actual recognition pro- 
cess. 

NOISEX contains one male and one female 
speaker uttering both isolated digits and digit 
triples. The test data for each speaker and condi- 
tion consists of a single file containing all of the 
test tokens spoken in sequence with a silence 
interval between each. Here only the pnale i<so- 
lated digits were usej o f which there are 100 
training tokens and 100 test tokens. Five of the 
eight possible background noises were used. 
Three stationary n oises: F16 fighter. Lv nx heli- 



copter and a car; and 2 non-stationary _n Qises:_ 
jTiachine gun and operations room. For each case, 
the noise is mixed with the clean speech at 5 
levels in the range + 18 dB to -6 dB. 

For each digit, a single mixture continuaus 
densitv HMM with R emittipp; srate^ was trained 
using the clean speech data only. The topology 
for all models was left -right with no skips and 
diagonal covariance matrices were assumed 



throughout. For each test condiiion, a^^c state 
noise HMM was trained using the siRIR inter-' 
vals of the test files with, except where stated, 4, 
full covariance matrix. Recognition u.sed a stan- 
dard connected-word Viterbi decoder con- 
strained by a syntax consisting of silence followed 
by a digit in a loop. Thus, no explicit end-point 
detector was used and insertion/ deletion errors 
occurred as well as classifications errors. The 
results are in terms of % accuracy, where for N 
tokens, 5 substitution errors, D deletion errors 
and / insertion errofs accuracy is calculated as 
[(A^-5-D-/)/yV]x 100%. The error counts 
themselves were calculated by using a DP string 
matching algorithm between the recognised digit 
sequence and the reference transcription. Since 
the NOISEX data is synthetic, the gain matching 
term g can be set exactly. Hence, for all the 
experiments reported here g = I was used. ^' *c. 
however, that in practice, PMC is not sensiti. . to 
the exact choice of g. All of the training and 
testing used version 1.4 of the portable HTK 
HMM Toolkit (Young, 1992), with suitable exten- 
sions to perform the PMC. 

Figure 3 shows the performance of the PMC 
compensated HMMs compared to the standard 
uncompensated HMMs. In this case, the compen- 
sated models use full covariance matrices. As can 
be seen, the compensation is effective down to at 
least 0 dB SNR. Table 1 shows that the effect of 
dropping the off-diagonal terms from the com- 
pensated covariance matrices to restore the run- 
time recogniser to using diagonal variances is 




SNR;>JB 



Fig. 3. PMC versus baseline HMM recogniser for male iso- 
lated digits in Lynx. F 16 and Car noise. 
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Table 1 

Comparison of full covariancc matrices (Full) with ittagnnal 
approximation (Diag) and scaled ftxed variance (Fi.xcd) 



SNR 
dB 


Lynx 






F16 






Car 




Full Diag 


Fixed 


Full 


Diag 
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Full Diag 


Fixed 
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l(X) 


Mn) 


94 96 
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+ 18 


98 


99 


100 


99 


100 


100 


98 99 


100 



minimal. It also shows that good performance can 
also be obtained from the scaled fixed variance 
method although this is not quite as good at very 
low SNRs. Note that without the scaling, the 
performance at 0 dB and below worsens consid- 
erably. For example, the accuracy for the F16 
noise at 0 dB drops from 87% to 51% when the 
scaling is removed, and at -6 dB it drops from 
38% to 21%. 

A related issue to covariance approximation is 
Ih*-. q'jw;;t;o:i of v.liciher Jiagcix-iai covaiicincci ^i^ 
adequate for the noise models. Table 2 compares 
recognition performance for the three stationary 
noise sources for full covariance and diagonal 
covariance noise models. In both cases, a diago- 
nal covariance was used with the compensated 
models. The results show that for these noise 
sources at least, full covariance models are un- 
necessary. 

Figure 4 shows the performance of PMC with 
non-stationary background noise. For these 
noises, more than 1 noise state is essential. For 
the operations room noise, there is a very small 
further improvement using a 4 state model com- 
pared to a 2 state model, but for machine gun 



Table 2 

Comparison of full covariance noise model (FulIN) with diag- 
onal covariance noise model (DiacN) 
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Fig. 4. Effect of number of noise slates for modelling non-sta- 
tionarv not^g 



noise the performance is unchanged for 2 or 
more noise states. 



5. Conclusions 

i no papci iiai uij>cuoScd tiiC uoc oi i'aiailci 
Model Combination (PMC) parameter compensa- 
tion for transforming a set of HMM word models 
trained on clean data into a set of models which 
can be used under a specific set of noise condi- 
tions. 

The PMC approach has been evaluated on the 
synthetic NOISEX-92 database and shown to give 
a significant improvement to the noise robustness 
of an HMM-based rccogniser. Two practical as- 
pects of PMC have been discussed and evaluated. 
Firstly, simple diagonalisation of the compen- 
sated full covariance matrices by zeroing off-diag- 
onal terms shows no significant loss in perfor- 
mance. Also, where a single global variance must 
be used, then a scaled fixed variance scheme 
gives adequate performance. Secondly, for non- 
stationary noise, an ergodic noise model with 2 or 
more noise states is both necessary and effective. 

The performance of an HMM recogniser for 
vei7 high SNRs is not affected by PMC compen- 
sation. The use of delta (difference) coefficients 
would further improve the results reported here 
for SNRs of -»-6 dB and bette r. However, the use 
of uncompensated delta coefficients at low SNrT 
would seriouslvjdamage performance. A n effec- 
tive compensation scheme for delta coefficients is 
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therefore also needed and work is in progress in 
this area. 

The overall conclusion is that PMC is a very 
simple yet effective approach to dealing with noise 
in an HMM based system. For relatively static 
noise conditions, a once-only adjustment of the 
HMM means and variances is all that is required, 
and hence the run-time cost is zero. In practice, it 
might be expected that a new noise model and 
consequent set of parameter adjustments would 
be recomputed periodically when the recogniser 
was idle. For more rapid ly changing noise. nmltL 
_p_le_ state noise models must be us ^d. These re- 
quire the original speech model states to be du- 
plicated to include all combinations of noise and 
speech state, but in practice the duplication fac- 
tor can be kept very small. 
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Appendix A. Derivation of log-normal expecta- 
tions 



unity and all other Ci)nj|^onciu.s zc^^J^o. it ha 
the property 



The expectation of e'* is defined in Section 2.1 



as 



E[e''] = / /Cexp(x,-i(x-Mr-S:- 

X(x-M))dx 
= k{ exp(x,. - WS'^x-^-x^E'^iL 

-V-^"V)djr, (24) 

where K is the usual normalising constant and 
M = E(x]. This integral may be evaluated as fol- 
lows. Let 

j=M+-Se', (25) 
where e' is a unit vector with the i-th component 



A, =xV. 



From the definition of y. 

exp(-i(x-y)' v:-'(x->-)) 

= exp( - W^-'x + x^E- ' y - {y^E- '■ y) 

= exp(x,-i(4r-Mf 1:-'(x-m) - M - U,,) 

(27 

Hence, 

E[e'. ] = A'/ exp(-\(x-yfs-\x-y) 
■'.if' 

+M, + U„)d.v 
= e'"*-"/^{A'^ exp( - H^-yfi, 

The integrand of the term in braces is a simple 
Gaussian and the integral is therefore unity, hencc 
the required result follows: 

E[e''] = e'^'*-"/-. (29 

The expectation of e'*e^' follows in an identi 
cal way, but this time the substitution 

= -i-i;(€f' + fO (30 

is used leading to the result 

E[e'' e'']=M/M>e-". (31^ 
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